Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ObjectStoreLocationProvider by default #1509

Merged

Conversation

smaheshwar-pltr
Copy link
Contributor

Following the discussion in #1452 (comment), this PR proposes diverging from the historical, Java-side preference of disabling object storage location provision by default, and proposes setting the OBJECT_STORE_ENABLED_DEFAULT table property default to True in PyIceberg.

@@ -297,8 +297,8 @@ def test_object_storage_location_provider_excludes_partition_path(
tbl = _create_table(
session_catalog=session_catalog,
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}",
# write.object-storage.partitioned-paths defaults to True
properties={"format-version": str(format_version), TableProperties.OBJECT_STORE_ENABLED: True},
# Both write.object-storage.enabled and write.object-storage.partitioned-paths default to True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just changing this one integration test to test that it's now the default; I've kept test_writes::test_object_storage_data_files unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wydt about adding assert statements as well? comments can become stale over time

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, done

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jan 11, 2025

Want to strongly highlight for folks reviewing (@kevinjqliu + @Fokko?) that I've not changed WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT to False even though it's a similar change that intends to improves object store performance. I'm very open to discuss/change this (I'll update this PR name + description or put up a new one if so).

So do we want to do this? My thinking was: after reading through apache/iceberg#11112 it seems there was a strong case for supporting partition values in paths. When I thought about that, I realised that users might be relying on storage layout and files constituting partitions lying in the same directory - maybe just so that they can inspect storage to see how their files are actually laid out at a given point in time, especially in the context of partition evolution.

On the other hand, maybe we should make this change to discourage reliance on file storage layout. (It would also bring object storage performance improvements by default which is a plus)

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jan 11, 2025

The main downside of this PR is that file paths and their directory structures are more obscured by default with a bunch of hash directories inserted in the middle. But IMHO this is a reasonable tradeoff (and doesn't feel as dramatic as excluding partition information from file paths entirely). It won't change existing data file paths and they'll still work fine.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

In terms of the default value for WRITE_OBJECT_STORE_PARTITIONED_PATHS, here are the four different scenarios. (source)

  1. WRITE_OBJECT_STORE_PARTITIONED_PATHS=True for Partitioned Table: f"{partition_key.to_path()}/{data_file_name}"
  2. WRITE_OBJECT_STORE_PARTITIONED_PATHS=True for NonPartitioned Table: f"{prefix}/{hashed_path}/{data_file_name}"
  3. WRITE_OBJECT_STORE_PARTITIONED_PATHS=False for Partitioned Table: f"{prefix}/{hashed_path}-{data_file_name}"
  4. WRITE_OBJECT_STORE_PARTITIONED_PATHS=False for NonPartitioned Table: f"{prefix}/{hashed_path}-{data_file_name}"

Option 1 aligns with the current SimpleLocationProvider implementation for Partitioned Table. I think there is some benefits of having the partition_key.to_path() as the default since it's easier to reason about than a bunch of hashes. Furthermore, I look at ObjectStoreLocationProvider as S3/Object Store optimization and should only be turned on in specific scenarios.

My preference would be to keep WRITE_OBJECT_STORE_PARTITIONED_PATHS=True to satisfy option 1.

@@ -190,7 +190,7 @@ class TableProperties:
WRITE_PY_LOCATION_PROVIDER_IMPL = "write.py-location-provider.impl"

OBJECT_STORE_ENABLED = "write.object-storage.enabled"
OBJECT_STORE_ENABLED_DEFAULT = False
OBJECT_STORE_ENABLED_DEFAULT = True # Differs from the Java implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of the comment here, let's add it to the docs

Copy link
Contributor Author

@smaheshwar-pltr smaheshwar-pltr Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I've remove for now.

Then we can add all the new location-relevant write options introduced in #1452 when we address #1510 together with the other docs?

@@ -297,8 +297,8 @@ def test_object_storage_location_provider_excludes_partition_path(
tbl = _create_table(
session_catalog=session_catalog,
identifier=f"default.arrow_table_v{format_version}_with_null_partitioned_on_col_{part_col}",
# write.object-storage.partitioned-paths defaults to True
properties={"format-version": str(format_version), TableProperties.OBJECT_STORE_ENABLED: True},
# Both write.object-storage.enabled and write.object-storage.partitioned-paths default to True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wydt about adding assert statements as well? comments can become stale over time

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jan 11, 2025

In terms of the default value for WRITE_OBJECT_STORE_PARTITIONED_PATHS, here are the four different scenarios. (source)

  1. WRITE_OBJECT_STORE_PARTITIONED_PATHS=True for Partitioned Table: f"{partition_key.to_path()}/{data_file_name}"

Small correction @kevinjqliu - the function recurses on this line with this value being the new file name - it doesn't return it. So, hashes are still prepended to the file name after the prefix, and it's not the same as when SimpleLocationProvider is used.

To clarify, this isn't a bug. The ObjectStoreLocationProvider should include entropies to reduce prefix collision - that shouldn't change just because data is partitioned. I realise it was written confusingly, sorry about that. I matched the Java implementation here, and FWIW I suspect this confusing "recursive" (in Java, it's overloading not recursion) implementation is what led to what I've described here as a potential bug (/unintended result) on their part.

I'm realising maybe we should add a unit test for this case (Case 1) even though the Java implementation doesn't have one. We do have https://github.com/apache/iceberg-python/blob/main/tests/table/test_locations.py#L85-L86 but it doesn't test that entropies were injected for object storage. I've put up #1511.

Let me know if your thoughts regarding a potential WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT change given this.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We'd probably want to document this but i see #1510 is opened to track

@kevinjqliu kevinjqliu requested a review from Fokko January 12, 2025 19:12
@Fokko Fokko merged commit c409678 into apache:main Jan 13, 2025
7 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jan 13, 2025

Thanks for the quick follow up @smaheshwar-pltr, and thanks for the review @kevinjqliu 🙌

@smaheshwar-pltr smaheshwar-pltr deleted the smaheshwar/object-store-default-provider branch January 13, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants